Several come from here: https://venngage.com/blog/bad-infographics/
2022-09-20
Several come from here: https://venngage.com/blog/bad-infographics/
We provide some general principles we can use as a guide for effective data visualization.
Much of this section is based on a talk by Karl Broman titled Creating Effective Figures and Tables and includes some of the figures which were made with code that Karl makes available on his GitHub repository, as well as class notes from Peter Aldhous’ Introduction to Data Visualization course.
Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles.
We compare and contrast plots that follow these principles to those that don’t.
The principles are mostly based on research related to how humans detect patterns and make visual comparisons.
The preferred approaches are those that best fit the way our brains process visual information.
When deciding on a visualization approach, it is also important to keep our goal in mind.
We may be comparing a
We start by describing some principles for visualy encoding numerical values. There are several approaches at our disposal including:
Example:
Suppose we want to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015.
For each year, we are simply comparing five quantities – the five percentages for Opera, Safari, Firefox,IE, and Chrome.
Here we are representing quantities with both areas and angles, since both the angle and area of each pie slice are proportional to the quantity the slice represents.
This turns out to be a sub-optimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue.
The donut chart is an example of a plot that uses only area:
Can you determine the actual percentages and rank the browsers’ popularity?
Can you see how the percentages changed from 2000 to 2015?
A better approach is to simply show the numbers. It is not only clearer, but would also save on printing costs if printing a paper copy:
| Browser | 2000 | 2015 |
|---|---|---|
| Opera | 3 | 2 |
| Safari | 21 | 22 |
| Firefox | 23 | 21 |
| Chrome | 26 | 29 |
| IE | 28 | 27 |
Length is the best visual cue:
Label each pie slice with its respective percentage so viewers do not have to infer them from the angles or area:
When using barplots, it is misinformative not to start the bars at 0.
This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed.
By avoiding 0, relatively small differences can be made to look much bigger than they actually are.
This approach is often used by politicians or media organizations trying to exaggerate a difference.
Below is an illustrative example used by Peter Aldhous in this lecture.
(Source: Fox News, via Media Matters.)
Here is the correct plot:
Another examples:
(Source: Fox News, via Flowing Data.)
And here is the correct plot:
One more example:
(Source: Venezolana de Televisión via Pakistan Today and Diego Mariano.)
Here is the appropriate plot:
When using position rather than length, it is then not necessary to include 0.
In particularly when comparing differences between to within groups variability.
During President Barack Obama’s 2011 State of the Union Address, the following chart was used to compare the US GDP to the GDP of four competing nations:
(Source: The 2011 State of the Union Address)
Here is comparison of using radius versus area:
Of course, in this case, we really should not be using area at all since we can use position and length:
When one of the axes is used to show categories the default ggplot2 behavior is to order the categories alphabetically when they are defined by character strings.
If they are defined by factors, they are ordered by the factor levels.
We rarely want to use alphabetical order.
Instead, we should order by a meaningful quantity.
Note tjat the plot on the right is more informative:
We can make the second plot like this:
data(murders)
murders |> mutate(murder_rate = total / population * 100000) |>
mutate(state = reorder(state, murder_rate)) |>
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
theme(axis.text.y = element_text(size = 6)) +
xlab("")
Here is another example:
We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.
Supppose we want to describe the heigh data to an extra-terrestrial.
A commonly used plot, popularized by Microsoft Excel, is a barplot like this:
heights |> ggplot(aes(sex, height)) + geom_point()
But this plot can be improved with jitter as there is much over-plotting
heights |> ggplot(aes(sex, height)) + geom_jitter(width = 0.1, alpha = 0.2)
Since there are so many points, it is more effective to show distributions rather than individual points. We therefore show histograms for each group:
Use common axes
If horizontal comparison, stack graphs vertically
If vertical comparison, stack graphs horizontally
heights |> ggplot(aes(height, ..density..)) + geom_histogram(binwidth = 1, color="black") + facet_grid(sex~.)
Stack horizontally
heights |>
ggplot(aes(sex, height)) +
geom_boxplot(coef=3) +
geom_jitter(width = 0.1, alpha = 0.2) +
ylab("Height in inches")
Here is a terribly plot comparing population across continents
Using a log transformation here provides a much more informative plot. Compare these two plots:
logit
sqrt
Note that it is hard to compare 1970 to 2020 by country:
Much easier if they are adjacent
The comparison becomes even easier to make if we use color to denote the two things we want to compare:
gapminder |>
filter(year %in% c(1970, 2010) & !is.na(gdp)) |>
mutate(dollars_per_day = gdp/population/365, year = factor(year)) |>
ggplot(aes(continent, dollars_per_day, fill = year)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
ylab("Income in dollars per day")
An example of how we can use a color blind friendly palette is described here.
color_blind_friendly_cols <-
c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
Here are the colors
There are several resources that can help you select colors, for example this one.
In general, you should use scatterplots to visualize the relationship between two variables.
However, there are some exceptions.
Slope charts adds angle as a visual cue, useful when comparing two groups and each element across two variables, such as years.
Shows difference in the y-axis and average on the x-axis.
We can use
different colors or shapes for categoris
areas, brightness or hue for continuous values
Below is an example that encodes three variables: OPEC membership, region, and population.
When selecting colors to quantify a numeric variable, we choose between two options: sequential and diverging.
Sequential colors are suited for data that goes from high to low. High values are clearly distinguished from low values. Here are some examples offered by the package RColorBrewer:
library(RColorBrewer) display.brewer.all(type="seq")
Diverging colors are used to represent values that diverge from a center. We put equal emphasis on both ends of the data range: higher than the center and lower than the center.
library(RColorBrewer) display.brewer.all(type="div")
The figure below, taken from the scientific literature, shows three variables: dose, drug type and survival.
(Image courtesy of Karl Broman)
Try to determine the values of the survival variable in the previous plot.
Can you tell when the purple ribbon intersects the red one?
This is an example in which we can easily use color to represent the categorical variable instead of using a pseudo-3D:
Pseudo-3D is sometimes used completely gratuitously: plots are made to look 3D even when the 3rd dimension does not represent a quantity. This only adds confusion and makes it harder to relay your message. We show two examples:
(Images courtesy of Karl Broman)
By default, statistical software like R returns many significant digits.
The default behavior in R is to show 7 significant digits.
That many digits often adds no information and the added visual clutter can make it hard for the viewer to understand the message.
As an example, here are the per 10,000 disease rates, computed from totals and population in R, for California across the five decades:
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.8826320 | 18.3397861 | 0.8266512 |
| California | 1950 | 13.9124205 | 4.7467350 | 1.9742639 |
| California | 1960 | 14.1386471 | NA | 0.2640419 |
| California | 1970 | 0.9767889 | NA | NA |
| California | 1980 | 0.3743467 | 0.0515466 | NA |
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.883 | 18.340 | 0.827 |
| California | 1950 | 13.912 | 4.747 | 1.974 |
| California | 1960 | 14.139 | NA | 0.264 |
| California | 1970 | 0.977 | NA | NA |
| California | 1980 | 0.374 | 0.052 | NA |
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.9 | 18.3 | 0.8 |
| California | 1950 | 13.9 | 4.7 | 2.0 |
| California | 1960 | 14.1 | NA | 0.3 |
| California | 1970 | 1.0 | NA | NA |
| California | 1980 | 0.4 | 0.1 | NA |
Useful ways to change the number of significant digits or to round numbers are
signif
round
You can define the number of significant digits globally by setting options like this: options(digits = 3).
Another principle related to displaying tables is to place values being compared on columns rather than rows. Compare these two presentations:
| state | disease | 1940 | 1950 | 1960 | 1970 | 1980 |
|---|---|---|---|---|---|---|
| California | Measles | 37.9 | 13.9 | 14.1 | 1 | 0.4 |
| California | Pertussis | 18.3 | 4.7 | NA | NA | 0.1 |
| California | Polio | 0.8 | 2.0 | 0.3 | NA | NA |
Another principle related to displaying tables is to place values being compared on columns rather than rows. Compare these two presentations:
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.9 | 18.3 | 0.8 |
| California | 1950 | 13.9 | 4.7 | 2.0 |
| California | 1960 | 14.1 | NA | 0.3 |
| California | 1970 | 1.0 | NA | NA |
| California | 1980 | 0.4 | 0.1 | NA |
Graphs can be used for
our own exploratory data analysis,
to convey a message to experts, or
to help tell a story to a general audience.
Make sure that the intended audience understands each element of the plot.